home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Almathera Ten Pack 2: CDPD 1
/
Almathera Ten on Ten - Disc 2: CDPD 1.iso
/
pd
/
101-125
/
102
/
match_stuff
/
mat.doc
< prev
next >
Wrap
Text File
|
1995-03-13
|
47KB
|
1,082 lines
============================================================================
|| ||
|| MAT -- p.n.; poss. abbrev. of "Match"; also "Matte" [Motion ||
|| Picture Arts]: means of cutting, inserting and superposing ||
|| disparate items. ||
|| ||
|| ------------------- ||
|| ||
|| This program provides a flexible string-matching and substitution ||
|| mechanism for both text and filenames. It will probably be most ||
|| useful within command script files to extend the operations ||
|| possible with AmigaDOS. The matching scheme is an extended ||
|| version of the standard AmigaDOS pattern-matching convention, ||
|| with the added features of negation and "slicing" of matched ||
|| strings. ||
|| ||
|| * Searches for patterns within text files. ||
|| ||
|| * Rearranges text within matched lines to ||
|| create new files. ||
|| ||
|| * Searches directories for matching file names. ||
|| ||
|| * Creates Command Script Files using the whole or ||
|| parts of matched filenames. ||
|| ||
|| ||
|| -- Copyright (C)1987 Peter J. Goodeve -- ||
============================================================================
-- by Pete Goodeve --
August 1987
This is one of those programs that is probably overkill for most
purposes, but on rare occasions may do something that no other program can.
It grew out of some long-time desires I've had while working with the Amiga
(...and grew... and grew...). First, I wanted to be able to search text
files for more than just simple strings: I wanted full pattern-matching
ability. Second, I wanted to be able to do things equivalent to this sort
of "impossible" command:
"RENAME myfile#? AS myfile_backup#?"
Eventually, some tortuous evolution led to a single program, "Mat",
that combines both facilities into one (because 90% of the code was
common). Along the way, I extended Richards' original matching algorithm
-- recoded in C from BCPL -- to handle negation patterns ("don't match the
string if this pattern matches"), the "*" as an optional alternative to
"#?", and the marking of "slice points" in the matched string so it can
later be cut into pieces and spliced with other text.
The resulting program is much more flexible than "Search" or any of the
file matching commands, but it is not by any means a full language like,
say, "awk". Because of its flexibility, the command line syntax can get a
little involved compared to other DOS commands, so it is probably easiest
to use within prepared script files. With a command syntax like that, of
course, Mat can't be invoked from a WorkBench icon, only from a CLI.
A couple of things I should also mention at this early opportunity.
In distinction to "Search" -- which is a simple string search -- this is a
line-matching program. In other words the pattern you specify must match
the whole line; if the pattern you are looking for can occur anywhere on
the line (an "unanchored" pattern) it must be preceded and followed by
wild-cards ("#?" or "*"). You will also find that pattern matching can be
SLOWW! On a simple (unanchored) string search it is about half the speed
of "Search", and each alternation you add increases the time by about the
simple-search time: it is doing a lot of work on each character in the
file. On the other hand, anchored matches can be quite fast, because each
line scan can quit at the first failure. You may find it preferable to RUN
the program as a background process; it works nicely with "pipes" if you
have them installed.
The program can essentially operate in two different modes: 1) it can
scan text files for lines matching a specified pattern; 2) it can scan
directories for filenames matching patterns. In either case matched
strings can either be output unchanged or they can be cut up (according to
"slice marks" in the pattern) and rearranged -- possibly with added text
--on the basis of a supplied "template". As well as pieces of the source
string, things like line-count and current filename can be spliced into the
output. (There are actually two more modes. In one you can perform the
same sort of operations on literal arguments in the command line; this is
only really intended for the rare occasion you might want to do some
complex test or operation on a command script argument. The other mode
sends the contents of the specified files in sequence to the standard
output with no processing at all -- it's a little more flexible than JOIN.)
All control is from the command line; a specific syntax -- possibly
including keywords -- determines each form.
Before getting into a complete description, some specific examples
might give a flavor of what it can do.
MAT *(word|another)* myfile
Mat scans "myfile" for lines containing "word" or "another", and displays
them on screen. (Or of course you could redirect them to a desired file
with the ">" convention.) Note the <any-string> markers ("*") surrounding
the main pattern so that it is unanchored.
MAT (W|w)ord* myfile
looks for "Word" or "word" at the beginning of a line only. The match is
normally case-sensitive -- although as we'll see later there is a keyword
to switch this off.
MAT "#?(word|another||another word)#?" myfile
In this case Mat will look for "word" and "another" as before, but will NOT
match a line containing the string "another word". The double vertical bar
is one of two ways of signifying negation. Notice that quotes around the
pattern are necessary now because of the imbedded space; I also used the
alternative way of specifying <any-string> ("#?"), partly for demonstration
and also because it may be a good idea to avoid the asterisk inside quotes
as a general principle (it is itself a quoting character in certain
circumstances).
MAT *^(word|another)^* "^1: ^0----^2" myfile
This time the pattern includes "slice marks" ("^"), so it is followed by a
template indicating how the pieces are to be output. "^0" is the segment of
the line matched by the part of the pattern before the first slice mark,
"^1" is the segment between the two marks (i.e. "word" or "another",
depending on which was matched), and "^2" is the rest of the line. Thus,
if the input line was
"this line contains the word we are looking for"
the output would be
"word: this line contains the ---- we are looking for"
(By the way, in this document "^" always means the caret character -- NEVER
the control key!)
MAT *(word|another)*^ "^F: ^O" my#? :T/#?.txt
Here Mat is doing the same search as in the first example, but we have
included a slice mark to force the next argument to be a template. This
time, instead of just searching "myfile", we are looking at ALL files
beginning with "my", and also all files in the directory :T that end in
".txt". The "^F" code in the template outputs the current filename, and
"^O" ("Oh" -- not "zero") outputs the whole Original line.
MAT >RAM:rn FILES "rename ^F as ^0_old^1" #?^.c
In this final example we can see how to achieve my dream of renaming
multiple files. The output of Mat is directed to the temporary file
"RAM:rn" so that we can then EXECUTE it as a command script to do the
actual renaming. The keyword "FILES" indicates that the search is to be
for filenames, rather than the text they contain; in this syntactic form
the keyword must be followed by a template to generate the output line. In
this form, we can put slice marks into the file pattern, so no separate
main pattern is needed. Within the template, "^F" refers to the original
filename as before, and the new name will have "_old" inserted at the slice
point (immediately before ".c", as specified in the file pattern).
If you got completely lost in the foregoing, please read the full
manual that follows and try again. First will be a description of the
command line syntax for the various modes, then a discussion of pattern
matching in general, followed by a description of the template format; the
last section will detail the ways you can specify filename patterns.
%%%%%%%%%%%%%%
Command Line Format
___________________
Text-line Matching Mode:
This mode searches text files for lines which match the supplied
pattern. The command format is:
MAT <pattern> <file-specifier>...
where <pattern> is any pattern that does not contain slice-marks. The
following arguments (at least one, but as many as you like) specify the
text file(s) you want to search for the pattern; they may be simple
file names (in the current directory), path names to files in another
directory, or patterns that specify sets of matching files. See the
section on File Name Patterns for the full range of possibilities.
Each line found in the text files searched that matches the pattern
will be sent to the standard output -- normally the screen, but it may
be sent to a file or device with the standard DOS redirection operator
(">").
The file being searched will normally be a standard text file, with
each line terminated with a newline character, but non-text characters
cause no problem, and it is not fatal if newlines are missing. If 256
characters are read before a newline is reached these will be treated
as a line, and the scan will resume with the character following; if
the string is output, though, it will have a newline added at the end,
unless the NOLINES keyword is used (see keywords section).
By default, text searches are case-sensitive: "A" does not match "a".
You can ignore case in a search by including the NOCASE keyword in the
command line. (Keywords are not shown in the basic command format to
avoid complexity and confusion. Discussion of their general use has
its own section below.)
Value returned to the CLI:
When Mat returns to the CLI, it passes back a value of zero if it has
found at least one match. If it has found no matches at all it returns
a "Warning" value of 5. This happens in all modes, and can be tested
by a command script to see if the intended operation has been
successful. If you should just want to know if a match exists, without
needing to see any output, you can simply redirect this to NIL:.
If Mat encounters an error which prevents it from continuing, like an
incorrectly formed pattern, it will return at once with an error code
of 20.
Slice'n Splice Mode:
This mode searches text files for matching lines, but instead of simply
outputting matched lines, the lines found can be cut into pieces
according to "slice-marks" in the pattern; the output lines are built
from these and other items under the control of a template argument.
The command format is:
MAT <slice-pattern> <template> <file-specifier>...
where <slice-pattern> is any pattern that contains at least one
slice-mark ("^"). It must be followed immediately by the template
argument that determines the format of each output line; these can be
generated for both matching and non-matching source lines. File
specifiers are the same as in the previous mode.
For matched lines the template can rearrange the sliced pieces of the
text and embed other constant text or such things as line number and
current file name. For lines that don't match, the original line,
fixed text, line number and so on can be output. Whatever its
contents, each output string always ends with a newline, unless the
NOLINES keyword is used (see below). For details of the template
format see the later section on the subject.
Don't forget that if you use a template you MUST include at least
one slice mark in the pattern you supply (even if you don't actually
want to cut the matched line up). Otherwise the program will get very
confused.
Keywords:
Variations on the above formats are controlled by keywords in the
command line. In general, these may be placed either before the
pattern argument or intermixed with file specifiers; they must never be
put between a pattern and its template. The exact effect may depend on
where on the command line they are placed; in many situations you could
have several interspersed along the line. Mat always processes the
command arguments in sequence, from left to right (unlike the position
independent keywords of AmigaDOS commands).
There are eleven possible keywords in this release of Mat (two of which
are just shorthand for others): NOCASE, CASE, FILES (or F), STRING,
JOIN (or J), FIRST, ALL, NOLINES and LINE. They may be in upper or
lower case as convenient. The keywords FILES (F), STRING, and JOIN
(J), set the mode of operation of the program; it is possible, but
probably not sensible, to change the mode in the middle of a command
line; there is no keyword to restore the default text search mode. The
other keywords may be used in any mode where they are appropriate.
NOCASE causes all subsequent searches to ignore the case of pattern and
text characters. It can be put anywhere in the command line subject to
the above restrictions; file specifiers appearing before it will not be
affected.
CASE cancels the effect of a previous NOCASE on the line; as this is
the default, you probably won't need it very often.
FILES (or its shorthand alternative F) selects directory filename
search mode (see next section) rather than the default text file
search. It may change the command syntax: if it is the first argument,
it REPLACES the usual pattern argument, and MUST be followed by a
template. You may also place it after a slice-pattern/template pair;
it is even permissible to put it between file specifiers if for some
odd reason you wished to mix the two types of searches, but this is not
recommended.
STRING selects a mode where the pattern is matched against the literal
argument strings that follow on the command line. Either a simple
pattern or sliced-pattern/template pair can be used. This mode is
intended primarily for EXECUTE command scripts where you might want to
test that a supplied argument satisfied some pattern constraints, or
slice and rebuild it in some way. You could also use it from the
keyboard to watch the effect of a particular pattern on various
strings.
JOIN (or its shorthand J) needs neither pattern nor template (and both
MUST be omitted if the keyword is placed first). It causes all the
files that match the specifier arguments to be sent in sequence to the
standard output. No matching or other processing is done on the
contents (and these may be anything -- not necessarily text).
FIRST is only appropriate in text matching modes. It causes the search
of each file after it on the command line to terminate when the first
match is found. It is useful when you just want to determine which
files contain a pattern, rather than listing every occurrence. It is
compatible with templates and other options.
ALL reverses the effect of FIRST if you should need to do so within a
command line. It will probably never be needed.
NOLINES prevents the usual newline character being output after each
match. All subsequent matches will be shown on the same line unless
the template dictates otherwise. Don't forget that you will usually
want some sort of separator in the template, such as a space. It can
be used in any mode.
LINE reverses the effect of NOLINES if this has been given previously.
(Apologies for the plural/singular disparity, but it isn't quite the
inverse.) It also inserts a newline into the output at that point; you
can use it just for this if you want an extra blank line between file
specifiers.
If for some odd reason you should have to specify a pattern that is
exactly one of these keywords you can easily distinguish it by putting
parentheses around it or appending the null-string-match character "%".
Filename Search Mode:
This mode searches for filenames which match the supplied specifiers;
it does not examine the contents of the files. A template is always
required to specify the form of the output. There are two command
forms for this mode:
MAT FILES <template> <file-specifier>...
MAT <slice-pattern> <template> FILES <file-specifier>...
In all cases, the abbreviation F can be used instead of FILES.
In the first form, the FILES keyword replaces the usual pattern
argument; it must be followed by a template. This simply finds all
files which match the supplied specifiers. Slice marks may be included
in the filename part of a specifier (but not in the path part); if they
are present they will be recognized by the template, but they are
always optional. In this particular form, the template does not require
that slice-marks be present. As in other filename searches, the match
never pays any attention to the case of either pattern or filename.
In the second form, both a slice-pattern and a template must be
present. Any filenames which match the specifiers will then be matched
against the main pattern and the appropriate action taken; any slice
marks in the specifiers are just ignored. There are two notable
effects of this two-stage matching. First, by default the final stage
is case-sensitive -- though the NOCASE keyword will reverse this.
Second, you can output lines for names that DON'T match in the second
stage, as well as ones that do.
Literal String Match Mode:
This mode tests string arguments in the command line against the
pattern. The match can either be simple or sliced with a template.
Possible command forms are:
MAT STRING <pattern> <string-argument>...
MAT STRING <slice-pattern> <template> <string-argument>...
The STRING keyword could also occur after the pattern or
pattern/template pair. Putting it after one or more file specifiers
would change modes in the middle of the command; it is remotely
possible that this might be useful.
Output from this mode is just as in the other modes, and as usual it
will return the value 5 to the CLI if all matches fail.
File Concatenation Mode:
With this mode, you can join multiple files into a single stream sent
to the standard output. You can use it as a multiple "TYPE" command,
or -- if you redirect the output to a file -- as a "JOIN" command that
handles patterns.
MAT JOIN <file-specifier>...
You may use J as an abbreviation for JOIN if you wish. No pattern or
template should be included. The program pays no attention to the
contents of the files: they are simply treated as byte streams.
========
Pattern Matching
________________
The pattern matching algorithm used by Mat is an extension of the
standard file pattern matching scheme used by AmigaDOS. Many people may
not appreciate how general and flexible the method is. It is many times
more capable than the simple "wild-card" matching available on most
personal computers. There are some things that the standard algorithm
doesn't have which would often be useful, and I have done my best to supply
some of these in this extended version.
The discussion that follows may be a fuller exposition of how to use
pattern matching than is available from other sources. If you leave out
references to the "universal-match" character "*", "negation matches", and
"slicing", everything discussed applies just as well to standard AmigaDOS
patterns, which can be used in commands like LIST, DELETE, and COPY.
A pattern is a text string constructed from "plain characters" and
"special characters". It represents a set (possibly a large set) of text
strings that will match it. Remember that it always matches complete
strings; this is not the same as a simple text search, where a match is
signalled if the search string is found anywhere within the source text.
The string being matched by the pattern is always "bounded" in some way,
either because it stands alone -- like a file name -- or because, say, it
is a complete line of text. The newline character at the end is not
usually available to the matching process.
If a pattern argument in a command line contains spaces, it must of
course be enclosed in quotes. There is no way of including quotes in a
pattern which is itself enclosed in quotes, unfortunately, (because of the
way C handles argument strings).
The syntax of the pattern structure is such that complex patterns can
be built from simple ones. Broadly speaking, patterns may be chained end
to end so that successive segments of a complete target string may be
matched by successive segments of the pattern. In addition, each pattern
segment can specify "alternatives": if any of these match, the whole
segment matches.
Plain Characters:
The simplest pattern is a string of plain characters. This will only
match a target string consisting of exactly the same characters in the
same order, which is obviously of limited usefulness. The only case
where you are likely to want this is when getting a particular file
name, and the program is smart enough to go directly to the file in
this case rather than doing a search.
Special Characters:
To build more general patterns we need the special characters. These do
not represent themselves (unless special action is taken): they are
instead structural elements that form the structure of the patterns we
desire. Using them we can build patterns -- or subpatterns -- that will
match, say, any single character, any five characters, any arbitrary
string, or a string that is one of several possible specific
alternatives. We can then put such subpatterns together to end up with
a complete pattern that will match all the various possibilities we are
looking for and no others. The possibilities should become clearer as
we get to specific examples.
The seven special characters used in AmigaDOS file matching are:
' ? | ( ) # and %
To these Mat adds two more:
~ and ^
We'll look at them briefly in order, before we get into a fuller
exploration:
" ' " makes the character following it into a plain character.
" ? " matches ANY single character.
" | " separates alternative patterns.
" ( " and " ) " enclose patterns used in building larger ones.
" # " causes a match to any number of repetitions of the pattern
it precedes.
" % " matches the null string when syntactically necessary.
" ~ " is one way (of two) of sprecifying negation.
" ^ " slices a matched string into segments.
Quoting Characters:
The single quote (" ' ") is used to turn any special character
immediately following it into a plain character. Thus to match against
an actual question mark in a target text you would include the pair
" '? " in the pattern. And of course it can quote itself.
Matching Any Character:
The question mark matches ANY single character. Thus:
???
matches "abc", "xyz", and so on, but not "ab" or "abcd".
Matching Alternatives:
The vertical bar (" | ") separates "alternatives". If any of a set of
patterns separated by bars matches the target, the match is successful.
For example:
abc|def|qwertyuiop
would match any of those three strings, but no others.
The pattern
abc|x?z
would match "abc" or "x" and "z" separated by any single character.
Building Patterns from Others:
The left and right parentheses can be used to enclose a pattern that
you want to match as a unit when it is part of a larger pattern. As one
example we could look for any two characters followed by "abc" or "def"
with the pattern:
??(abc|def)
Combine two or more patterns in sequence this way:
(abc|def)(xxx|yyy)
This will match "abcxxx", "abcyyy", "defxxx", and "defyyy".
Patterns can be nested as far as you like with parentheses:
a(bc|??(xx|yy))d
will match "abcd", or any six-letter group beginning with "a" and
ending in "xxd" or "yyd".
Redundant parentheses do no harm. They may be useful to distinguish
patterns from other constructs.
Pattern Repetition:
The " # " character is always followed by a (sub)pattern. It will match
ANY number of (exact) repetitions of that pattern (INLUDING zero). The
pattern may be a single letter, but if it isn't it must be enclosed in
parentheses. Thus:
#(ab)
matches "ab", "abababab", or simply an empty string. It does not match
"ababa".
Ther pattern to be repeated may be any legal pattern, including more
repetition constructs if you want:
#(ab|?x|#(xy)z)
will match such strings as "abab", "zxab", "qxxyxyxyxyzxyab", and so
on. It will NOT match "abxy".
Matching the Empty String:
The " % " character is used where you have to specify an empty ("null")
string -- normally as one of a number of alternatives. The
construction
(|abc)
is not legal; instead you must use:
(%|abc)
which will match either "abc" or the null string.
Negated Matching:
Mat extends the basic pattern matching syntax by allowing you to
specify patterns that if matched will cause the overall match to fail.
If a negated segment is included in a pattern, and the target string
has ANY POSSIBLE match of the whole pattern that includes that segment,
the match cannot succeed. There are restrictions on negation patterns
not shared by the structures we've talked about up to now; in
particular they can't be nested -- you can't negate a negation --
although they can be inserted at any level in the pattern.
There are two ways of specifying negated patterns. The first will
match ANY string UNLESS it exactly matches the pattern; it is
constructed by prefixing the pattern by the tilde (" ~ "):
ab~(cd)e
will not match "abcde", but will match any other string that begins
with "ab" and ends with "e", such as "abxxxe", "abe", "abce", etc..
The second form is a "negated alternative", indicated by two adjacent
vertical bars (" || "). This is used when, rather than matching ANY
string that is not the negated one, you have a set of patterns you want
to match UNLESS the negated part is also matched. Thus:
a(b?d|?c?||bcd)
will match four character strings such as "abxd", "accc", "abcx", as
long as the whole string is not "abcd".
You can have more than one negated segment, as long as one does not
appear inside another. Thus the following sort of thing is possible
(whether it's also useful though...?):
a~bc~(de)(???||fgh||xyz)
Remember that this will be forced to fail if there is any possible
match that includes a negated section, but on the other hand the tilde
construction matches any string that is not exactly the one specified.
Thus these will succeed:
acxxx
abbbcddeabc
acdexy
and these will fail:
abcxxx
abbcdexxx
aczxsdefrgthcjxsxcxyz
Slicing the Matched String:
If it is appropriate to the function of the program, you can include
"slice marks" (the caret -- " ^ ") in your pattern to select out pieces
of the matched string that can be treated individually. The way these
pieces are accessed is not the concern of the matching procedure; in
the case of Mat, the template argument provides ways of referencing
them.
Once again there is a restriction on the use of this character that
does not apply to the others: only the first four of these marks
encountered during a match will be recorded; any after this will be
ignored. Note that this doesn't mean you can only include a maximum of
four marks; if they are inside alternatives that don't match any part
of the target string, the scan will never encounter them. You should
be sure of what you are doing, though, if you don't want to be
surprised by the program's choices. We'll return to this, and some
other points you should note about the behavior of slice marks, later.
If there is more than one possible match of the pattern to the target,
the slice will be made at the earliest possible point. Remember this
especially when you have repetitions in your pattern.
Examples:
The pattern #?^x#?
will cut abcdxyz
into abcd xyz
It will also cut abcxxxx
into abc xxxx
The pattern #?^x#?y^#?
will cut abcxxxxyz
into abc xxxxy z
The pattern #?^#x^#?
won't cut much of anything! (because "#x" also matches the null
string.) The first two slices will simply always be empty, and
slice three will contain the whole string.
The pattern #?^(word|another)^#?
will cut "here is another word for you"
into "here is" "another" "word for you"
(using quotes in this case to mark off the slices). Notice that
the cuts are made around "another" rather than "word" because the
earliest match is found.
Slice marks within alternatives can be used, as noted above, but are
tricky. Because of the way the marks are recorded internally, if two
different alternatives containing them match, both marks will be
reported but the position of one of them will be wrong (probably at the
beginning of the string). So it is best to keep the slice marks
outside of any alternation constructions (as shown in the last example
above).
Templates
_________
The Templates Mat uses to generate output lines are basically simple
text strings with "splice-markers" that indicate where the pieces of the
matched string and other items are to be inserted. The text segments if a
template can be anything you want (except a newline -- there is a marker
for this). A special marker can be used to divide the template string into
"success" and "fail" halves; the "success" part controls the format of
output lines for matches, while the "fail" part will be output for each
input string that doesn't match. Output strings are always terminated with
a newline.
Each marker is a character pair: the caret (" ^ ") followed by a
selector character. Slices from the matched string are numbered -- "^0" to
"^4" . Other items have identifying letters, such as "^N" for line number;
the case of these letters is important (all are currently upper case
because you are already holding down the shift key for the caret). The
success/fail divider uses the vertical bar: "^|".
Not all selectors are valid under all conditions. For example you can't
use slices in the "fail" section of a template because -- obviously --
there aren't any. Line numbers, on the other hand, are only appropriate in
text matching, not in file name matching mode. If you use a selector that
is not valid it is simply skipped over. Of course you can use any selector
more than once within a template.
If a template argument in a command line contains spaces, it must of
course be enclosed in quotes. As with a pattern, you can't include quotes
in a template which is itself enclosed in quotes: use the "^Q" selector
instead.
Slice Selectors:
As four slice marks are allowed in a pattern, there can be a maximum of
five slices of the matched string. These are selected by "^0" for the
piece from the beginning of the string to the first mark, "^1" for the
piece between the first and second, up to "^4" for the remainder of the
string beyond the fourth mark. If there are fewer than four slice
marks, the slice associated with the final existing mark extends to the
end of the string, and all higher-number pieces are empty. Thus if
there are only two marks, "^2" covers the remainder of the string, and
"^3" and "^4" are empty.
For instance, if we use this pattern, with two slice marks:
#?^word ^#?
and this template -- which will omit slice 1:
^0^2
to match and rearrange the string:
"this word will be missing"
we will end up with:
"this will be missing"
Line Number Selector:
The selector characters "^N" placed in a template string will insert
the current line Number within the file being scanned at that point in
the output string. The number is always five digits, with leading
zeros visible. At the moment there is no option to suppress leading
zeros, partly because it makes it easier to line up columns and partly
because I wanted fast code there [and don't forget the laziness
syndrome, Peter...]. It can be used in both "success" and "fail"
portions of a template.
So the pattern #?^(word|another)^#?
and template ^N: ^1
would generate something like 00234: another
Index Number Selector:
The pair "^I" inserts an Index number representing a count of matches
so far. The count is kept from the beginning of the program, and is not
reset with a new file. It also works in file name matching mode. You
may use it in the "fail" section of a template, but remember it will
indicate the number of matches, not lines output. The format is the
same as "^N".
Original String Selector:
The pair "^O" ("Oh", not "zero" -- I probably should have chosen a
better one...) represents the unsliced Original string. It can be used
in both the "success" and "fail" parts of a template. Thus, to simply
put a line number in front of each matched line, you could use the
template:
^N: ^O
In File Matching mode, this selector is the same as "^F" (below).
Line Break Selector:
The pair "^B" Breaks the output line at that point with a newline
character. For instance, to output line number and slice-1 on one
line, followed by the original string on a new line, use:
^N: ^1^B^O
Quote Mark Selector:
It is not usually possible to embed quote marks in template strings
directly, so you can use the selector "^Q" to make them appear at that
point in the output line.
^0 ^Q^1^Q ^2
File Name Selector:
"^F" selects the local name of the current File (i.e without any
directory prefix), in both text and file name matching modes.
For example, if you have a filename specifier argument (see later)
:work#?/#?.txt
which has found the file
Work Disk:work_1/sample.txt
the "^F" selector will insert
sample.txt
Directory Path Selector:
"^D" in a template will insert the path to the Directory of the current
file, as seen by Mat, based on the specifier argument it is using.
The exact form of this string depends on the way you have formed the
directory part of the file specifier argument. It does NOT always
contain a complete path from device to file. If you are just assuming
the current directory the string will be empty. If the file is in a
different directory it will show the chain of directories it has used
to reach it, using the full directory names it has found.
Some specific examples:
If we used the file specifier as in the previous section, and found the
same file:
Work Disk:work_1/sample.txt
^D would insert
Work Disk:work_1
On the other hand, if the specifier was
work#?/#?.txt
^D would give
work_1
You should especially note that if you did not use a device specifier
(":") in the first section of your specifier, yet the first directory
in the chain IS in fact a root directory, you will see a slash "/"
separator rather than the colon in the string supplied by ^D. Thus if
your specifier happened to be
/work#?/#?.txt
the Directory path would be shown as
Work Disk/work_1
Failure Template Marker:
A simple template is only applied to strings which have been matched,
and nothing is output when there isn't a match. You can split the
template, however, into two subsections with the special success/fail
division marker "^|". The section preceding this mark is applied for a
successful match just like a simple template; the section following it
is used if the match fails. In the "fail" section, any selectors
desired can be used, except the five slices "^0" - "^4".
A simple use would be to output all lines, whether or not they matched,
but mark or rearrange the matched lines in some way. For example the
following would output them all but put a marker and index number on
each matched line (and corresponding blanks before an umatched one):
MATCH[^I]> ^O^| ^O
File Specifiers
_______________
The arguments in the command line you supply to specify the files that
Mat will examine are really just like those you might give to any AmigaDOS
command, but there are one or two extra features.
For text file searches you will probably most often want to specify a
single file. You do this in the usual way with either the local name of a
file in the same directory, or a path name that includes the chain of
directories needed to reach that file in another. In place of the simple
file name, you can use a pattern to match a group of files in the same
directory. Unlike other AmigaDOS commands this pattern can employ the
extended matching features described above ("*", "~", and "||"). Slice
marks can also be used where they are appropriate (see below).
You can also use patterns in the directory part of the specification,
in just the same way as in the filename part. (Did you know that you can
also do this in most AmigaDOS commands supporting patterns, such as
DELETE?) All the directories matching that specification will be searched
in turn. However, you cannot split a pattern across directories -- in other
words, a pattern must not include a device or directory separator (":" or
"/"). This means that a given pattern can only match directory names at a
certain "level" in the file hierarchy of the disk. Also you cannot use a
pattern in a device specifier -- these must be simple names. To search
more than one level, or more than one device, you must have more specifier
arguments.
In File Name Search mode, if you don't supply any other pattern, you may
put slice marks in the file name portion of the specifier. You cannot
place them in the directory part. Except in this particular situation,
with no main pattern present, slices in the filename will be ignored.
Examples:
These are valid file specifiers:
myfile.txt
my#?file(.txt|_bak)
df1:work/myfile
:work/myfile
/work/myfile
:(work|old)/my^#?
These are not:
df(0|1):#?/#? -- pattern in device part
df1:/#(work/)myfile -- pattern includes directory separator
:w^#?/my^#? -- slice mark in directory part
The Real World
______________
Even before it had reached its current form, this program was put to
good use a couple of times. One case needed the text match facility, the
other the filename match.
I had a documentation file for another project, written in "proff" text
formatter format, and I wanted to convert it to "troff" for typesetting on
a Un*x system. I ran it through Mat several times, first simply to locate
all the formatter commands -- a simple job because they are all lines
beginning with a period --, and then to actually rewrite some of the
commands in troff form, slicing up the original commands and reusing the
appropriate parts. I was even able to generate some added commands to
create indented paragraphs and so on.
The other application also involved text files, this time an article I
had written for a newsletter. The disk I passed on to the magazine's
editor already had two versions of the text, and he had to cut it further
into pieces for the layout program. So when it came time to put Part 2 on
the same disk, I had all these old files -- which I didn't want to throw
out -- cluttering the top level of the disk. Of course I could have simply
copied them to a new directory and then deleted the originals, but for one
thing that would change their date. Better to rename them all to be in a
new directory, except that without Mat I would have had to do each one
individually. The commands I used were something like:
makedir old
mat >ram:rn f "rename ^F as old/^0_Part.1_^1" article^#?
execute ram:rn
(except that I was using Sili(Con:), so I could type simply "ram:rn",
rather than "execute ram:rn"...).
As another example, I have a little command script file I call "ref",
which searches for the pattern given as its first argument in the files
corresponding to the following ones, and prints out matching lines with the
match itself highlighted:
.K PAT,FILE1,FILE2,FILE3
Mat "#?^<PAT>^#?" "^0<esc>[1;33;40m^1<esc>[0;31;40m^2" <FILE1> <FILE2>
<FILE3>
Notice that the pattern argument you supply is automatically surrounded
with universal matches and slice marks. If you aren't familiar with the
strange strings in the template, these are ANSI control sequences
recognized by the console device to change text color; <esc> is the ESCape
character. To create a script file like this of course you'll need an
editor such as EMACS that can handle <esc> as a character.
Here is a short script that is convenient for sending multiple files to
your printer; it is a little better than PRINTFILES (the 1.2 Extras
command) in that it allows filename patterns (and is configured to run
automatically in the background):
.K f1,f2,f3,f4,f5
.bra {
.ket }
;*** caution -- the next line is too long for most printers! ***
run mat >t:_pr f "cd ^Q^D^Q^Btype >prt: ^F^Becho >prt: ^Q^Q" {f1} {f2} {f3} {f4} {f5} +
execute t:_pr +
delete t:_pr
It also shows up some of the deficiencies of the current version of Mat
(as elaborated in the next section): for example it is better to CD to the
"^D" directory because it is not always possible to concatenate directory
and filename (if "^D" is a device for instance); also if you aren't careful
in specifying pathnames, "^D" may not be a proper directory identification
string -- see the discussion on "^D" in the Template section.
Deficiencies and Prospects
__________________________
Mat is obviously missing some things in its current incarnation.
When and whether they get added will depend both on my own further needs
and moods and on your feedback. (I'd be delighted with monetary
contributions, but this is for you to use and distribute anyway...)
The matching algorithm, even though extended over the original, is
in need of a couple more options. Un*x "regular expresssions" have a way
of specifying sets and ranges of characters, and while I don't think Mat
needs sets -- alternation handles this reasonably well -- some way of
specifying ranges would be a big advantage. I have often wanted to search
for "any letter", say. I think the best way to implement this would be to
define specific "range selectors" such as "any letter", "any digit", "any
upper case", and so on, probably using a convention like "\a".
The selectors available for template use could also be expanded. In
particular there is an obvious need for a true directory path, rather than
the one that Mat currently assembles from the individual directory names it
has traversed. Any other suggestions?
Keywords are easy to add, so... Oh, you noticed -- yes, well their
number did grow rather rapidly, but there are still a few I would like to
add. HEAD and TAIL would be followed by template-type arguments that would
generate output at the beginning and end respectively of each file. TITLE
would do the same thing wherever it occurred in the command line. These
would be especially useful in NOLINES mode.
The matching algorithm is slower than I would like. Rewriting it in
assembly would doubtless help, but there is also a need for a "quick match"
feature for simpler patterns closer to those that SEARCH can handle.
I have other vaguer notions, too, such as how to request multiple
levels of directories in a specifier, but I'll have to think about them
awhile.
Distribution and Copyrights
___________________________
Mat itself and this manual are copyright, but may be freely distributed
without charge. Commercial use is prohibited without the express written
permission of the author.
The matching algorithm code is public domain. It is an extension of an
algorithm in an original article by Martin Richards in "Software Practice
and Experience" 1979. The source is in fairly generic 'C'; it has only
been compiled under Lattice 3.10, but should be readily transportable.
Remarks and Suggestions to:
Peter Goodeve
3012 Deakin Street #D
Berkeley, Calif. 94705
%%%%%%%%%%%%